Time series analysis (Pandas)

Nikolay Koldunov

koldunovn@gmail.com

This is part of Python for Geosciences notes.

================

Here I am going to show just some basic pandas stuff for time series analysis, as I think for the Earth Scientists it's the most interesting topic. If you find this small tutorial useful, I encourage you to watch this video, where Wes McKinney give extensive introduction to the time series data analysis with pandas.

On the official website you can find explanation of what problems pandas solve in general, but I can tell you what problem pandas solve for me. It makes analysis and visualisation of 1D data, especially time series, MUCH faster. Before pandas working with time series in python was a pain for me, now it's fun. Ease of use stimulate in-depth exploration of the data: why wouldn't you make some additional analysis if it's just one line of code? Hope you will also find this great tool helpful and useful. So, let's begin.

As an example we are going to use time series of Arctic Oscillation (AO) and North Atlantic Oscillation (NAO) data sets.

Module import

First we have to import necessary modules:


In [1]:
import pandas as pd
import numpy as np
pd.set_option('max_rows',15) # this limit maximum numbers of rows

And "switch on" inline graphic for the notebook:


In [2]:
%matplotlib inline

Pandas developing very fast, and while we are going to use only basic functionality, some details may still change in the newer versions.


In [3]:
pd.__version__


Out[3]:
'0.16.1'

Loading data

Now, when we are done with preparations, let's get some data. If you work on Windows download monthly AO data from here. If you on *nix machine, you can do it directly from ipython notebook using system call to wget command:


In [ ]:
!wget http://www.cpc.ncep.noaa.gov/products/precip/CWlink/daily_ao_index/monthly.ao.index.b50.current.ascii

Pandas has very good IO capabilities, but we not going to use them in this tutorial in order to keep things simple. For now we open the file simply with numpy loadtxt:


In [5]:
ao = np.loadtxt('monthly.ao.index.b50.current.ascii')

Every line in the file consist of three elements: year, month, value:


In [6]:
ao[0:2]


Out[6]:
array([[  1.95000000e+03,   1.00000000e+00,  -6.03100000e-02],
       [  1.95000000e+03,   2.00000000e+00,   6.26810000e-01]])

And here is the shape of our array (note that shape of the file might differ in your case, since data updated monthly):


In [7]:
ao.shape


Out[7]:
(785, 3)

Time Series

We would like to convert this data in to time series, that can be manipulated naturally and easily. First step, that we have to do is to create the range of dates for our time series. From the file it is clear, that record starts at January 1950 and ends at September 2013 (at the time I am writing this, of course). You have to adjust the last date according to values in your file! Frequency of the data is one month (freq='M').


In [8]:
dates = pd.date_range('1950-01', '2014-01', freq='M')

As you see syntax is quite simple, and this is one of the reasons why I love Pandas so much :) Another thing to mention, is that we put October 2003 instead of September because the interval is open on the right side. You can check if the range of dates is properly generated:


In [9]:
dates


Out[9]:
DatetimeIndex(['1950-01-31', '1950-02-28', '1950-03-31', '1950-04-30',
               '1950-05-31', '1950-06-30', '1950-07-31', '1950-08-31',
               '1950-09-30', '1950-10-31', 
               ...
               '2013-03-31', '2013-04-30', '2013-05-31', '2013-06-30',
               '2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31',
               '2013-11-30', '2013-12-31'],
              dtype='datetime64[ns]', length=768, freq='M', tz=None)

In [10]:
dates.shape


Out[10]:
(768,)

Now we are ready to create our first time series. Dates from the dates variable will be our index, and AO values will be our, hm... values. We are going to use data only untill the end of 2013:


In [11]:
AO = pd.Series(ao[:768,2], index=dates)

In [12]:
AO


Out[12]:
1950-01-31   -0.060310
1950-02-28    0.626810
1950-03-31   -0.008127
1950-04-30    0.555100
1950-05-31    0.071577
1950-06-30    0.538570
1950-07-31   -0.802480
                ...   
2013-06-30    0.548650
2013-07-31   -0.011112
2013-08-31    0.154250
2013-09-30   -0.460880
2013-10-31    0.262760
2013-11-30    2.029000
2013-12-31    1.474900
Freq: M, dtype: float64

Now we can plot complete time series:


In [13]:
AO.plot()


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1071d07d0>

or its part:


In [14]:
AO['1980':'1990'].plot()


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x107424450>

or even smaller part:


In [15]:
AO['1980-05':'1981-03'].plot()


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1077b7310>

Reference to the time periods is done in a very natural way. You, of course, can also get individual values. By number:


In [16]:
AO[120]


Out[16]:
-2.4842

or by index (date in our case):


In [17]:
AO['1960-01']


Out[17]:
1960-01-31   -2.4842
Freq: M, dtype: float64

And what if we choose only one year?


In [18]:
AO['1960']


Out[18]:
1960-01-31   -2.484200
1960-02-29   -2.212400
1960-03-31   -1.624600
1960-04-30   -0.297310
1960-05-31   -0.857430
1960-06-30    0.054978
1960-07-31   -0.619060
1960-08-31   -1.007900
1960-09-30   -0.381640
1960-10-31   -1.187000
1960-11-30   -0.553230
1960-12-31   -0.342950
Freq: M, dtype: float64

Isn't that great? :)

One bonus example :)


In [19]:
AO[AO > 0]


Out[19]:
1950-02-28    0.626810
1950-04-30    0.555100
1950-05-31    0.071577
1950-06-30    0.538570
1950-09-30    0.357970
1951-07-31    0.090023
1951-12-31    1.987200
                ...   
2013-04-30    0.322210
2013-05-31    0.494010
2013-06-30    0.548650
2013-08-31    0.154250
2013-10-31    0.262760
2013-11-30    2.029000
2013-12-31    1.474900
dtype: float64

Data Frame

Now let's make live a bit more interesting and download more data. This will be NAO time series (Windowd users can get it here).


In [ ]:
!wget http://www.cpc.ncep.noaa.gov/products/precip/CWlink/pna/norm.nao.monthly.b5001.current.ascii

Create Series the same way as we did for AO:


In [21]:
nao = np.loadtxt('norm.nao.monthly.b5001.current.ascii')
dates_nao = pd.date_range('1950-01', '2014-01', freq='M')
NAO = pd.Series(nao[:768,2], index=dates_nao)

Time period is the same:


In [22]:
NAO.index


Out[22]:
DatetimeIndex(['1950-01-31', '1950-02-28', '1950-03-31', '1950-04-30',
               '1950-05-31', '1950-06-30', '1950-07-31', '1950-08-31',
               '1950-09-30', '1950-10-31', 
               ...
               '2013-03-31', '2013-04-30', '2013-05-31', '2013-06-30',
               '2013-07-31', '2013-08-31', '2013-09-30', '2013-10-31',
               '2013-11-30', '2013-12-31'],
              dtype='datetime64[ns]', length=768, freq='M', tz=None)

Now we create Data Frame, that will contain both AO and NAO data. It sort of an Excel table where the first row contain headers for the columns and firs column is an index:


In [23]:
aonao = pd.DataFrame({'AO' : AO, 'NAO' : NAO})

One can plot the data straight away:


In [24]:
aonao.plot()


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x107b44250>

Or have a look at the first several rows:


In [25]:
aonao.head()


Out[25]:
AO NAO
1950-01-31 -0.060310 0.92
1950-02-28 0.626810 0.40
1950-03-31 -0.008127 -0.36
1950-04-30 0.555100 0.73
1950-05-31 0.071577 -0.59

We can reference each column by its name:


In [26]:
aonao['NAO']


Out[26]:
1950-01-31    0.92000
1950-02-28    0.40000
1950-03-31   -0.36000
1950-04-30    0.73000
1950-05-31   -0.59000
1950-06-30   -0.06000
1950-07-31   -1.26000
               ...   
2013-06-30    0.52076
2013-07-31    0.67216
2013-08-31    0.97019
2013-09-30    0.24060
2013-10-31   -1.28010
2013-11-30    0.90082
2013-12-31    0.94566
Freq: M, Name: NAO, dtype: float64

or as method of the Data Frame variable (if name of the variable is a valid python name):


In [27]:
aonao.NAO


Out[27]:
1950-01-31    0.92000
1950-02-28    0.40000
1950-03-31   -0.36000
1950-04-30    0.73000
1950-05-31   -0.59000
1950-06-30   -0.06000
1950-07-31   -1.26000
               ...   
2013-06-30    0.52076
2013-07-31    0.67216
2013-08-31    0.97019
2013-09-30    0.24060
2013-10-31   -1.28010
2013-11-30    0.90082
2013-12-31    0.94566
Freq: M, Name: NAO, dtype: float64

We can simply add column to the Data Frame:


In [28]:
aonao['Diff'] = aonao['AO'] - aonao['NAO']
aonao.head()


Out[28]:
AO NAO Diff
1950-01-31 -0.060310 0.92 -0.980310
1950-02-28 0.626810 0.40 0.226810
1950-03-31 -0.008127 -0.36 0.351872
1950-04-30 0.555100 0.73 -0.174900
1950-05-31 0.071577 -0.59 0.661577

And delete it:


In [29]:
del aonao['Diff']
aonao.tail()


Out[29]:
AO NAO
2013-08-31 0.15425 0.97019
2013-09-30 -0.46088 0.24060
2013-10-31 0.26276 -1.28010
2013-11-30 2.02900 0.90082
2013-12-31 1.47490 0.94566

Slicing will also work:


In [30]:
aonao['1981-01':'1981-03']


Out[30]:
AO NAO
1981-01-31 -0.11634 0.37
1981-02-28 -0.33158 0.92
1981-03-31 -1.64470 -1.19

even in some crazy combinations:


In [31]:
import datetime
aonao.ix[(aonao.AO > 0) & (aonao.NAO < 0) 
        & (aonao.index > datetime.datetime(1980,1,1)) 
        & (aonao.index < datetime.datetime(1989,1,1)),
        'NAO'].plot(kind='barh')


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x107cf7810>

Here we use special advanced indexing attribute .ix. We choose all NAO values in the 1980s for months where AO is positive and NAO is negative, and then plot them. Magic :)

Statistics

Back to simple stuff. We can obtain statistical information over elements of the Data Frame. Default is column wise:


In [32]:
aonao.mean()


Out[32]:
AO    -0.129191
NAO   -0.028639
dtype: float64

In [33]:
aonao.max()


Out[33]:
AO     3.4953
NAO    3.0400
dtype: float64

In [34]:
aonao.min()


Out[34]:
AO    -4.2657
NAO   -3.1800
dtype: float64

You can also do it row-wise:


In [35]:
aonao.mean(1)


Out[35]:
1950-01-31    0.429845
1950-02-28    0.513405
1950-03-31   -0.184064
1950-04-30    0.642550
1950-05-31   -0.259211
1950-06-30    0.239285
1950-07-31   -1.031240
                ...   
2013-06-30    0.534705
2013-07-31    0.330524
2013-08-31    0.562220
2013-09-30   -0.110140
2013-10-31   -0.508670
2013-11-30    1.464910
2013-12-31    1.210280
Freq: M, dtype: float64

Or get everything at once:


In [36]:
aonao.describe()


Out[36]:
AO NAO
count 768.000000 768.000000
mean -0.129191 -0.028639
std 1.008903 1.000970
min -4.265700 -3.180000
25% -0.670015 -0.762500
50% -0.051368 0.000000
75% 0.467525 0.670000
max 3.495300 3.040000

By the way getting correlation coefficients for members of the Data Frame is as simple as:


In [37]:
aonao.corr()


Out[37]:
AO NAO
AO 1.000000 0.611757
NAO 0.611757 1.000000

Resampling

Pandas provide easy way to resample data to different time frequency. Two main parameters for resampling is time period you resemple to and the method that you use. By default the method is mean. Following example calculates annual ('A') mean:


In [38]:
AO_mm = AO.resample("A")
AO_mm.plot()


Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x107de2750>

median:


In [39]:
AO_mm = AO.resample("A", how='median')
AO_mm.plot()


Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x10810a550>

You can use your methods for resampling, for example np.max (in this case we change resampling frequency to 3 years):


In [40]:
AO_mm = AO.resample("3A", how=np.max)
AO_mm.plot()


Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x10824f650>

You can specify several functions at once as a list:


In [41]:
AO_mm = AO.resample("A", how=['mean', np.min, np.max])
#AO_mm['1900':'2020'].plot(subplots=True)
AO_mm['1900':'2020'].plot()


Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1084c3210>

That's it. I hope you at least get a rough impression of what pandas can do for you. Comments are very welcome (below). If you have intresting examples of pandas usage in Earth Science, we would be happy to put them on EarthPy.